On this page

Skip to content

A Simple Test of Using WhisperDesktop for Speech-to-Text

WARNING

This is a record of a previous test. While WhisperDesktop is still functional, it has not been updated by the developer for a long time. I have since switched to Subtitle Edit integrated with Faster-Whisper, which is more actively maintained and faster. It is recommended to refer directly to the new article: Using Subtitle Edit Integrated with Faster-Whisper for Local Speech-to-Text.

Some time ago, while looking into ChatRTX, I came across the term "Whisper." After some research, I discovered that OpenAI Whisper is a speech transcription and translation AI model released by OpenAI in September 2022. For more information, you can refer to the article What is OpenAI Whisper?.

For an AI beginner like me, setting up an environment to run this model from scratch is a bit difficult. However, someone has developed an offline tool that can be used directly: WhisperDesktop.

Download and Installation

  1. Click on the latest version in the "Releases" section on the right sidebar of the GitHub repository homepage. The current version is Version 1.12.

    whisper desktop github release

  2. In the "Assets" section of the Release page, click on WhisperDesktop.zip (highlighted in the red box) to download it.

    whisper desktop assets zip

  3. After unzipping, you will see the following three files:

    • WhisperDesktop.exe: The executable file.
    • Whisper.dll: The library file.
    • lz4.txt: The license statement.

Downloading Models

Next, you need to download the models from the following website: Huggingface Whisper.

Model Sizes and Specifications

There are different model sizes available. Those with the .en suffix are English-only versions; there are also other extended models. The author of WhisperDesktop recommends using ggml-medium.bin, as it is the model they primarily use to test the software.

SizeParametersEnglish-onlyMultilingualVRAM RequiredRelative Speed
tiny39 Mtiny.entiny~1 GB~32x
base74 Mbase.enbase~1 GB~16x
small244 Msmall.ensmall~2 GB~6x
medium769 Mmedium.enmedium~5 GB~2x
large1550 MN/Alarge~10 GB1x

Usage

  1. Run WhisperDesktop.exe.

  2. Specify the location of the downloaded model in the "Model Path" field.

  3. Select GPU for "Model Implementation" (I am not sure what the other options are for, so I won't explain them here).

    • If your graphics card is not detected correctly, you can click advanced... to configure the settings in detail.

    whisper desktop advanced settings

  4. Click ok.

  5. For "Language," select the primary language of the video (for Chinese, there is only a "Chinese" option; the program will automatically determine Traditional or Simplified, though I am not sure what criteria it uses).

  6. If you want to translate to English, check "Translate." However, I found that it often fails when testing with music.

  7. For "Transcribe File," select the audio or video file you want to transcribe.

  8. For "Output Format," you can choose from the following:

    • None: No output file.
    • Text file (.txt): Plain text file.
    • Text with timestamps: Text file with timestamps.
    • SubRip subtitles (.srt): Common subtitle format containing timecodes and text.
    • WebVTT subtitles (.vtt): Subtitle format for web videos.
  9. Specify the output location and filename.

    whisper desktop output location

  10. If you do not want to specify a custom output location, you can check Place that file to the input folder.

    • This will save the output file in the same location as the input file.
    • The filename will be the original filename plus the extension corresponding to the output format.

The "Audio Capture" feature can directly read audio input from a microphone, but my computer could not detect my Bluetooth headset, so I will not explain this part.

Performance Test

Tested using a PNY RTX 4070 Ti Super 16GB Blower graphics card to convert a 5-minute and 16-second mp3 file:

  • Using ggml-large-v3.bin took 22 minutes and 01 seconds, and it did not always succeed (during testing, the file content was blank; it might require a different version of the large model to convert correctly).
  • Using ggml-medium.bin took only 11 seconds.

Tested using an i7-12700H integrated graphics (no dedicated GPU) with the same 5-minute and 16-second mp3 file:

  • Using ggml-tiny.bin took 41 seconds.
  • Using ggml-small.bin took 4 minutes and 19 seconds.
  • Using ggml-medium.bin took 13 minutes and 5 seconds.

The accuracy of the transcribed text improved significantly as the model size increased.

Conclusion

Based on the test results and speed considerations, here are my personal recommendations:

  • For users with a dedicated graphics card: I recommend using the ggml-medium.bin model.
  • For users with integrated graphics or older graphics cards:
    • Daily use: Choose ggml-small.bin. This is the smallest acceptable model; the accuracy of the ggml-tiny.bin model is too poor.
    • Important transcriptions: You can choose ggml-medium.bin and accept the longer processing time to obtain higher accuracy.

Change Log

  • 2025-03-24 Initial document created.
  • 2026-01-31 Added recommendation link to the new Faster-Whisper solution.